Computational Learning of Probabilistic Grammars in the Unsupervised Setting
نویسنده
چکیده
With the rising amount of available multilingual text data, computational linguistics faces an opportunity and a challenge. This text can enrich the domains of NLP applications and improve their performance. Traditional supervised learning for this kind of data would require annotation of part of this text for induction of natural language structure. For these large amounts of rich text, such an annotation task can be daunting and expensive. Unsupervised learning of natural language structure can compensate for the need for such annotation. Natural language structure can be modeled through the use of probabilistic grammars, generative statistical models which are useful for compositional and sequential structures. Probabilistic grammars are widely used in natural language processing, but they are also used in other fields as well, such as computer vision, computational biology and cognitive science. This dissertation focuses on presenting a theoretical and an empirical analysis for the learning of these widely used grammars in the unsupervised setting. We analyze computational properties involved in estimation of probabilistic grammars: the computational complexity of the inference problem and the sample complexity of the learning problem. We show that the common inference problems for probabilistic grammars are computationally hard, even though a polynomial sample is sufficient for accurate estimation. We also give a variational inference framework for estimation of probabilistic grammars in the empirical Bayesian setting, which permits the use of non-conjugate priors with probabilistic grammars as well as parallelizable inference. The estimation techniques we use include two types of priors on probabilistic grammars: logistic normal priors and adaptor grammars. We further extend the logistic normal priors to shared logistic normal priors, which define a distribution over a collection of multinomials that represent a probabilistic grammar. We test our estimation techniques on treebanks in eleven languages. Our empirical evaluation shows that our estimation techniques are useful and perform better than several Bayesian and non-Bayesian baselines.
منابع مشابه
Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning
Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the...
متن کاملCovariance in Unsupervised Learning of Probabilistic Grammars
Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayes...
متن کاملOn the Utility of Curricula in Unsupervised Learning of Probabilistic Grammars
We examine the utility of a curriculum (a means of presenting training samples in a meaningful order) in unsupervised learning of probabilistic grammars. We introduce the incremental construction hypothesis that explains the benefits of a curriculum in learning grammars and offers some useful insights into the design of curricula as well as learning algorithms. We present results of experiments...
متن کاملUnambiguity Regularization for Unsupervised Learning of Probabilistic Grammars
We introduce a novel approach named unambiguity regularization for unsupervised learning of probabilistic natural language grammars. The approach is based on the observation that natural language is remarkably unambiguous in the sense that only a tiny portion of the large number of possible parses of a natural language sentence are syntactically valid. We incorporate an inductive bias into gram...
متن کاملConstituent Structure for Filipino: Induction through Probabilistic Approaches
The current state of Philippine linguistic resources, which includes formal grammars, electronic dictionaries and corpora are not yet significant to address industrialstrength language technologies. This paper discusses a computational approach in automatically estimating constituent structures from a corpus using unsupervised probabilistic approaches. Two models are presented and results show ...
متن کامل